Big Data and Automated Content Analysis Final Project

Analysing the lyrics of the greatest hip hop artists of all time

Lukas Pechacek

11954922

03/06/2022

Damian Trilling

University of Amsterdam

Data collection

The list of artists was retrieved from the website ranker.com. The BeautifulSoup package was used to scrape the top 25 hip hop artists as well as the top 25 hip hop bands. The two lists were then joined together and saved as a text file. Afterwards, the Genius API was used to get the top 90 songs by each artist and the corresponding URLs to the song lyrics. Some artists did not have 90 songs on Genius. A list of these three artists was made and 60 songs per artist on this list were still retrieved. Once again, the BeautifulSoup package was used to scrape the lyrics from the retrieved URLs. Artists that had less than 30 songs on Genius were dropped from the list. Ultimately, 44 artists with 3870 songs were retained for further analysis. After scraping the lyrics all the information, namely the artists name, the song URL for each song and lyrics of each song were saved into a csv file.

Exploratory Analysis and preprocessing

Here are my exploratory analyses and preprocessing steps. Please find a more comprehensive summary of these before the topic modelling section.

There are 3870 songs with the attributes of 'Artist', 'lyrics' and the song 'url'. First task will be to analyse the data set and get rid of inconsistencies. By inspecting the tail of the data frame it is apparent that there are already some missing values. Some of the first cleaning task include:

There are 86 songs with missing lyrics. It does not make sense to keep these for further analyses. Therefore these rows will be dropped.

Inspecting sample lyrics: seems like the songs do not have paragraphs but multiple lines as sentences.

Next, duplicates will be dropped based on lyrics column. The rows are first checked for duplicates and put into another dataframe.

We see 18 rows of duplicates. As such, the original hiphop dataframe will have 18 rows less.

Next, the lyircs themselves will be examined more closely. There may a lot of instrumental songs as Hip Hop artists often put those into their albums. Otherwise it could also be that some songs may not have been scraped correctly, and as such their value remains questionable. It thus makes sense to :

Afterwards:

https://stackoverflow.com/questions/50821143/pandas-dataframe-count-unique-words-in-a-column-and-return-count-in-another-col

Unfortunately, it seems like a lot of songs have not been scraped correctly and did not return the desired output. From personal knowledge there are indeed some instrumentals. However, also from personal knowledge and inspecting the lyrics more closely, something indeed did not get scraped correctly (I tried the process again but got the same output). For now, songs with less than 25 words will be dropped. I still believe that those with more than 25 can still add value to the analysis.

We can see that Ghostface Killah has the most songs, while Eric B. and Rakim have the least amount of songs returned.

I believe it makes sense to add more words to stop words list that often appear in the lyrics, such as 'uh' 'oh', for instance.

Function removing stopwords and punctuation as well as spaces that can occur at the start or end of an article. This function was taken from https://www.youtube.com/watch?v=dtK7Xhn8XjY&t=129s&ab_channel=SolveBusinessProblemsUsingAnalytics and adapted. I do not claim this function as my own.

Creating a new column with cleaned lyrics

Seems like it worked. But there are still words visible such as'im', that should be removed. Therefore, another column is created, based on stop word removal from the already clean_lyrics column.

Let's see if it did anything

It seems like removing stop words on the already cleaned lyrics removed further words, such as 'im', visible on row[0]. So it seems for now the clean_lyrics1 column is more suitable.

Let's see which artists have the most word counts after the stop word removal.

It seems like there are still a lot of words that do not indicate many topics. Further preprocessing is needed before starting the analysis. I will also append the stop word list with some of these words such as 'get'and 'got'.

lemmatization, code inspired by - https://www.youtube.com/watch?v=TKjjlp5_r7o&ab_channel=PythonTutorialsforDigitalHumanities

To sum up: Exploratory Analyses and preprocessing

Exploratory analyses revealed that before the stop word removal Kendrick Lamar was the artist with the highest word count, while NWA was the artist with the lowest word count out of the artists of which 90 songs could be extracted. Interestingly, after stop word removal, Ghostface Killa became the artist with highest word count. Ghostface Killa and The Fugees were the artists with the highest number of unique words in the lyrics while rapper 50 Cent was the artist with the least number of unique words.

The csv file was then turned into a Pandas data frame, which consisted of 3870 rows and 3 columns. After checking for missing values, it became clear that the lyrics of 86 songs were not scraped at all. It made no sense to keep those in the dataset and as such were dropped. Next, it was checked whether some songs were duplicated. After inspection 18 rows were dropped as duplicates. A column was then created based on the word count per song. This was to see whether there were some songs with a suspiciously little number of words. This could be due to some songs not having been scraped correctly or some songs only being the instrumental version, and as such not very useful for the analysis. All songs with a word count between 0 and 25 were ultimately dropped from the data set. Anything above a word count of 25 still seemed as if it could add to the analysis. Consequently, the final data set used for the analysis consisted of 3433 songs. The next step of the pre-processing was to remove stopwords from the lyrics. Before this was done the stop word list was appended with a few more items after some inspection of the lyrics. Lyrics such as ‘la’, ‘lalala’, ‘oh’, ‘yeah yeah’ among others were added to the list as these were deemed as unnecessary words for the analysis. Lastly, the lyrics were lemmatized using the spacy lemmatizer to group together inflected forms of a word so they can be analysed as a single item, essentially retaining more words while removing different forms of the same word. Nouns, adjectives, verbs and adverbs were kept for kept for the analysis.

Topic Modelling

To model simultaneously which topics are found in the whole corpus of song lyrics, a topic modelling approach with latent Dirichlet Allocation (LDA) was chosen. This will allow us to see which topics are present in which documents, whilst also being able to make connections between words even if they are not in the same document (Maier et al., 2018). To start, the analysis initiated with the use of the count vectorizer conducting iterations on up to 100 topics. Next, the TfIdf vectorizer was used to perform the operation on the same number of topics. Afterwards, several operations were performed with different parameter configurations for the alpha hyperparameter. To see the results of these configurations a line plot was constructed indicating the perplexity and coherence scores. Perplexity being a score of how well the model is able to predict the word distribution while coherence is a score of how semantically coherent the topics are. (Mimno et al., 2011). A visualization by the genism package was also constructed on all models in order to get a better overview of the most relevant terms per topic. These scores indicated that for all models the inflection point was at k=20 topics. Lastly, a model was created taking into account bigrams and trigrams in order to measure words that could be used with each other.

I will first determine the number of topics I wish to explore. Afterwards, I will iterate over the lemmatized texts and fit the models accordingly. I will start with a model suing the Count Vectorizer, inspect the results as well as visualizations. Next, I will run the same model only using the TfIdf vectorizer. Afterwards I will perform some hyperparameter tuning and also construct a text made out of bigrams and trigrams and see whether these operations improve the models.

Lastly, I will discuss all the results.

Now the visualization

From running the first topic model we can see that there are no meaningful conclusions to be drawn yet. From the visualization we can see that there are a lot of overlapping topics, with the most relevant terms also not being a good indication for what these topics could be about.

From the perplexity and coherence plot we also can see a clear inflection point at k=20. However, after inspecting the visualization it seems better to try our hand using the tfidf vectorizer. By using the tdidf vectorizer we are transforming our text into a vector that on the basis of how many times the word occurs in the documents.

From the visualiztion we can see that there are now more non overlapping topics, which is great! Unfortunately, the topics still do not seem to be clear cut. Let's try tuning the parameters.

Let's plot the coherence and perplexity scores for this model. I will set the number of topics to 15 and the change up the alpha parameter to see how this affects our inflection point.

Changing the alpha parameter did not change the outcome by a lot. Let's therefore try running a model with 15 topics and the alpha set to auto.

Running the model does not seem to give us a good indication of topics. Topic 5 could be about the most mentioned areas in rap songs, namely bronx in New York and compton in south central Los Angeles with the words 'brookly', 'bronx', 'compton' and 'south'.

Next, let's try making some bigrams.

Results

Running the LDA topic model on up to 100 topics with the count vectorizer did not provide this analysis with any very good results. In fact, most of the words were not particularly good at harmonizing any topics. There was also a lot of overlap between the topics. Before running further topic modelling analysis. I went back to further customize the stop word list to exclude words that do not give much meaning in order to create a model with words that harmonize given topic more. Running the LDA model with the TfIdf vectorizer provided some slightly more promising insights. The metrics showed the weight of all words being much less than in the model using the count vectorizer. However, a lot of the topis were still largely overlapping. From topic 15 upwards there still seemed to be a lot of overlap. After tuning the alpha hyperparameter the perplexity and coherence scores indicated the inflection point at k=20 topics. As such as model was run with the number of topics at k=20 and the alpha set to auto and the passes parameter was set to 10. After the tuning of the hyperparameter there were 2 non over lapping topics with that made sense as seen on the visualization (vis_data1T).

Topic 2 = Topic about money, keywords: Money, dollar, bill, business, job

Topic 6 = Topic about violence, keywords: Violent, violence, ghetto

Yet, the rest of the topics did unfortunately not produce anything coherent. Next, bigrams and trigrams were introduced that somewhat successfully combined words such as ‘new’ and ‘york’ to make ‘New York’. However, by manually inspecting the metrics there not many other meaning combinations could be spotted. I still decided to plot the coherence and perplexity scores in the hope of improving the output of the model. The number of topics to be looked for ranged from 5 to 30 and with varying alpha values. After tuning these hyperparameters a final model was run on the data with bigrams and trigrams, number of topics at k=15 and the alpha and eta parameters set to auto. Unfortunately, it can be concluded here that these configurations with the data made into bigrams also does has not led to produce any coherent topics. Words on topic 3 on the untuned model on the visualization perhaps somewhat weakly loaded on the topic of ‘crime’ by containing the words ‘law’, ‘murder’ and ‘bullet’. Topic 3 also included the word ‘New_York’ and ‘bronx’ indicating that perhaps a lot of documents in the corpus were about crime in New York.

Sources

Ballard, M. E., Dodson, A. R., & Bazzini, D. G. (1999). Genre of music and lyrical content: Expectation effects. The Journal of Genetic Psychology, 160(4), 476-487.

Barradas, G. T., & Sakka, L. S. (2021). When words matter: A cross-cultural perspective on lyrics and their relationship to musical emotions. Psychology of Music, 03057356211013390.

Blackman, S. (2014). Subculture theory: An historical and contemporary assessment of the concept for understanding deviance. Deviant behavior, 35(6), 496-512.

Calvert, C., Morehart, E., & Papdelias, S. (2014). Rap music and the true threats quagmire: When does one man's lyric become another's crime. Colum. JL & Arts, 38, 1.

Dunbar, A. (2019). Rap music, race, and perceptions of crime. Sociology Compass, 13(10), e12732.

Kubrin, C. E. (2005). “I see death around the corner”: Nihilism in rap music. Sociological Perspectives, 48(4), 433-459.

Lena, J. C. (2006). Social context and musical content of rap music, 1979–1995. Social Forces, 85(1), 479-495.

MacDonald, R., Kreutz, G., & Mitchell, L. (2012). What is music, health, and wellbeing and why is it important. Music, health, and wellbeing, 3-11

Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... & Adam, S. (2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing (pp. 262-272).

North, A. C., Hargreaves, D. J., & O'Neill, S. A. (2000). The importance of music to adolescents. British journal of educational psychology, 70(2), 255-272.

Rothbaum, F., & Tsang, B. Y. P. (1998). Lovesongs in the United States and China: On the nature of romantic love. Journal of Cross-Cultural Psychology, 29(2), 306-319.

Stratton, V. N., & Zalanowski, A. H. (1994). Affective impact of music vs. lyrics. Empirical studies of the arts, 12(2), 173-184.

Tyson, E. H. (2002). Hip hop therapy: An exploratory study of a rap music intervention with at-risk and delinquent youth. Journal of Poetry Therapy, 15(3), 131-144.

van Atteveldt, W., Trilling, D., and Arc ́ıla Calder ́on, C. (2022). Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Wiley, Hoboken, NJ.